Introduction

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6.

set.seed(123)
N <- runif(1, 8, 1000)
n <- 10000
X <- runif(n, min = 0, max = N)

Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(μ=σ=(N+1)/2\).

set.seed(123)
me <- (N+1)/2
sd <- me
Y <- rnorm(n, mean = me, sd = sd)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

(x <- median(X))
## [1] 145.0453
(y <- quantile(Y, 0.25)[[1]])
## [1] 48.85452
    1. P(X>x | X>y)

\[P(X>x | X>y) = \frac{P(X>145.0453) \quad& P(X>48.85452) }{P(X>48.85452)}\]

(p_a <- sum(X>x & X>y)/n)
## [1] 0.5
(p_xy <- sum(X>y)/n) 
## [1] 0.8318
(p_a/p_xy)
## [1] 0.601106

0.6 is the probability that X is greater than its median given that X is greater than the first quartile of Y.

    1. P(X>x , Y>y)

\[P(X>x , X>y) = {P(X>145.0453) . P(Y>48.85452) }\]

(p_b <- sum(X>x & Y>y)/n)
## [1] 0.369

0.36 is the probability that X and Y are greater than all possible x and y.

    1. P(X<x | X>y)

\[P(X<x | X>y) = \frac{P(X<145.0453) . P(X>48.85452) }{P(X>48.85452)}\]

(p_c <- sum(X<x & X>y)/n)
## [1] 0.3318
(p_xy <- sum(X>y)/n) 
## [1] 0.8318
(p_c/p_xy)
## [1] 0.398894

0.39 is the probability of X less than its median and greater than the first quantile of Y.

Investigate whether \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

(res <- matrix(c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)),  ncol = 2, nrow = 2))
##      [,1] [,2]
## [1,] 1310 1190
## [2,] 3690 3810
res <- cbind(res,c(res[1,1] + res[1,2], res[2,1] + res[2,2]))
res <- rbind(res,c(res[1,1] + res[2,1], res[1,2] + res[2,2], res[1,3] + res[2,3]))
(results <- as.data.frame(res))
results
colnames(results) <- c("X>x", "X<x", "total")
rownames(results)  <- c("Y<y", "Y>y", "total")

results

Probability matrix

(prob_tab <- results/n)

Check \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) Check the right side : \(P(X>x)P(Y>y)\) from the table we get

(round(0.5*0.75, 2))
## [1] 0.38

Check the leftside: \(P(X>x \quad and \quad Y>y)\) from the table = 0.369 ~ 0.38

Since the results are so similar we can conclude that both X and Y are independent variable

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

fisher.test(results,simulate.p.value=TRUE)
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based
##  on 2000 replicates)
## 
## data:  results
## p-value = 0.1079
## alternative hypothesis: two.sided
chisq.test(results, correct=TRUE)
## 
##  Pearson's Chi-squared test
## 
## data:  results
## X-squared = 7.68, df = 4, p-value = 0.104

“Fisher’s exact test” is a test for comparing two proportions that is conditional on the marginal frequencies. The test is exact in guaranteeing that the type I error is less than or equal to what you specify. It can however be much less. P-values are too large and power is lost. In this sense ordinary chi-square tests are generally “more accurate” than “exact” tests. In addition, the old adage that chi-square tests are not accurate if an expected cell frequency is less than 5 is not correct.

Problem 2

library('ggplot2')
library('ggthemes') 
library('scales')
library('dplyr') 
library('mice')
library('randomForest') 
library('data.table')
library('gridExtra')
library('corrplot') 
library('GGally')
library('e1071')

loading dataset

library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:mice':
## 
##     complete
train <- read.csv('https://raw.githubusercontent.com/salma71/houses_prices_kaggle/master/datasets/train.csv', stringsAsFactors = F)
head(train)
colnames(train)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"

Identify the target variable which is the SalePrice

target <- train$SalePrice
summary(target)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Univariate analysis

dim(train)
## [1] 1460   81
str(train)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
sum(sapply(train[,1:81], typeof) == "character")
## [1] 43
#Count the number of columns that consists of numerical data

sum(sapply(train[,1:81], typeof) == "integer")
## [1] 38

We have 43 columns that consist of text and 38 columns are numerical.let’s take a look at some descriptive statistics

# Obtain summary statistics

summary(train[,sapply(train[,1:81], typeof) == "integer"])
##        Id           MSSubClass     LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.0   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 365.8   1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 730.5   Median : 50.0   Median : 69.00   Median :  9478  
##  Mean   : 730.5   Mean   : 56.9   Mean   : 70.05   Mean   : 10517  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :1460.0   Max.   :190.0   Max.   :313.00   Max.   :215245  
##                                   NA's   :259                      
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0  
##  Median :   0.0   Median : 383.5   Median :   0.00   Median : 477.5  
##  Mean   : 103.7   Mean   : 443.6   Mean   :  46.55   Mean   : 567.2  
##  3rd Qu.: 166.0   3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1474.00   Max.   :2336.0  
##  NA's   :8                                                           
##   TotalBsmtSF       X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  Min.   :   0.0   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  1st Qu.: 795.8   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  Median : 991.5   Median :1087   Median :   0   Median :  0.000  
##  Mean   :1057.4   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  3rd Qu.:1298.2   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  Max.   :6110.0   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                  
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000   Max.   :14.000  
##                                                                   
##    Fireplaces     GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.000   Min.   :1900   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.000   1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5  
##  Median :1.000   Median :1980   Median :2.000   Median : 480.0  
##  Mean   :0.613   Mean   :1979   Mean   :1.767   Mean   : 473.0  
##  3rd Qu.:1.000   3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :3.000   Max.   :2010   Max.   :4.000   Max.   :1418.0  
##                  NA's   :81                                     
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 15.06   Mean   :  2.759   Mean   :   43.49   Mean   : 6.322  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :480.00   Max.   :738.000   Max.   :15500.00   Max.   :12.000  
##                                                                        
##      YrSold       SalePrice     
##  Min.   :2006   Min.   : 34900  
##  1st Qu.:2007   1st Qu.:129975  
##  Median :2008   Median :163000  
##  Mean   :2008   Mean   :180921  
##  3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :2010   Max.   :755000  
## 

As we can see from the summary statistics above, there are some outliers needs to be proceessed. It would be clear to identify which dependant variables that need processing after visualization.

# The percentage of data missing in train

cat(sprintf("Percentage of missing values in the overall train dataset: %s%s\n", round(length(which(is.na(train) == TRUE)) * 100 / (nrow(train) * ncol(train)), 2), "%"))
## Percentage of missing values in the overall train dataset: 5.89%

Visualization

Creating one training dataset with categorical variable and one with numeric variable. We will use this for data visualization.

cat_var <- names(train)[which(sapply(train, is.character))]
cat_car <- c(cat_var, 'BedroomAbvGr', 'HalfBath', ' KitchenAbvGr','BsmtFullBath', 'BsmtHalfBath', 'MSSubClass')
numeric_var <- names(train)[which(sapply(train, is.numeric))]
train1_cat<-train[cat_var]
train1_num<-train[numeric_var]
library(ggplot2)
## Bar plot/Density plot function

## Bar plot function

plotHist <- function(data_in, i) 
{
  data <- data.frame(x=data_in[[i]])
  p <- ggplot(data=data, aes(x=factor(x))) + stat_count() + xlab(colnames(data_in)[i]) + theme_light() + 
    theme(axis.text.x = element_text(angle = 90, hjust =1))
  return (p)
}

## Density plot function

plotDen <- function(data_in, i){
  data <- data.frame(x=data_in[[i]], SalePrice = data_in$SalePrice)
  p <- ggplot(data= data) + geom_line(aes(x = x), stat = 'density', size = 1,alpha = 1.0) +
    xlab(paste0((colnames(data_in)[i]), '\n', 'Skewness: ',round(skewness(data_in[[i]], na.rm = TRUE), 2))) + theme_light() 
  return(p)
  
}

## Function to call both Bar plot and Density plot function

doPlots <- function(data_in, fun, ii, ncol=3) 
{
  pp <- list()
  for (i in ii) {
    p <- fun(data_in=data_in, i=i)
    pp <- c(pp, list(p))
  }
  do.call("grid.arrange", c(pp, ncol=ncol))
}

The bar plots below offer more insight into the data. MSZoning: bar plot indicates that majority of the houses are located in low density residential areas and medium density residential area.

The type of road access to the property tends to be paved and the houses do not have alleys.

Landcontour: the houses are built on flat properties Utilities: Almost all homes have all public utilities (E,G,W, & S) LandSlope: most of the properties have a gentle slope

## Barplots for the categorical features

doPlots(train1_cat, fun = plotHist, ii = 1:11, ncol = 4)

doPlots(train1_cat, fun = plotHist, ii = 12:22, ncol = 4)

The houses that have sever landslope are located in the Clear Creek and Timberland. The houses with moderate landslope are present in more neighborhood. The Clear Creek and the Crawford neighborhoods seem to have high slopes.

train %>% 
  select(LandSlope, Neighborhood, SalePrice) %>% 
  filter(LandSlope == c('Sev', 'Mod')) %>% 
  arrange(Neighborhood) %>% 
  group_by(Neighborhood, LandSlope) %>% 
  summarize(Count = n()) %>% 
  ggplot(aes(Neighborhood, Count)) + 
  geom_bar(aes(fill = LandSlope), position = 'dodge', stat = 'identity') + 
  theme_light() + 
  theme(axis.text.x = element_text(angle = 90, hjust =1))

ggplot(train, aes(x = Neighborhood, y = SalePrice)) +
  geom_boxplot() +
  geom_hline(aes(yintercept=80), 
             colour='red', linetype='dashed', lwd=2) +
  scale_y_continuous(labels=dollar_format()) +
  theme(axis.text.x = element_text(angle = 90))

Boxplot between the neighboorhoods and sale price shows that BrookSide and South & West of Iowa State University have cheap houses. While Northridge and Northridge Heights are rich neighborhoods with several outliers in terms of price.

doPlots(train1_num, fun = plotDen, ii = 2:23, ncol = 4)
## Warning: Removed 259 rows containing non-finite values (stat_density).
## Warning: Removed 8 rows containing non-finite values (stat_density).

library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
## 
##     outlier
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
correlations <- cor(na.omit(train1_num[,-1])) # omit the id
row_indic <- apply(correlations, 1, function(x) sum(x > 0.3 | x < -0.3) > 1)
correlations<- correlations[row_indic ,row_indic ]
corrplot(correlations, method="square")

ggcorr(correlations, palette = "RdBu", label = TRUE)

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(train1_num[,-1])
(p.mat[, 1:5])
##                 MSSubClass  LotFrontage      LotArea   OverallQual
## MSSubClass    0.000000e+00 4.846634e-44 8.194126e-08  2.127766e-01
## LotFrontage   4.846634e-44 0.000000e+00 3.697417e-54  8.391777e-19
## LotArea       8.194126e-08 3.697417e-54 0.000000e+00  5.105509e-05
## OverallQual   2.127766e-01 8.391777e-19 5.105509e-05  0.000000e+00
## OverallCond   2.342025e-02 4.019570e-02 8.296279e-01  4.361720e-04
## YearBuilt     2.875793e-01 1.813802e-05 5.869923e-01 8.400355e-128
## YearRemodAdd  1.211631e-01 2.052217e-03 5.985903e-01 1.514450e-116
## MasVnrArea    3.824712e-01 1.536390e-11 6.995925e-05  1.487730e-60
## BsmtFinSF1    7.598820e-03 2.364202e-16 1.340256e-16  1.610533e-20
## BsmtFinSF2    1.210779e-02 8.388577e-02 2.068606e-05  2.388567e-02
## BsmtUnfSF     6.637589e-08 3.979767e-06 9.203746e-01  1.732033e-33
## TotalBsmtSF   2.473278e-20 2.046592e-45 3.911258e-24 3.130271e-110
## X1stFlrSF     1.526459e-22 4.498144e-63 1.234238e-31  1.623197e-83
## X2ndFlrSF     1.985216e-33 5.433170e-03 5.144266e-02  8.317033e-31
## LowQualFinSF  7.586521e-02 1.827783e-01 8.552305e-01  2.452458e-01
## GrLivArea     4.213734e-03 4.612423e-48 1.520481e-24 2.247511e-139
## BsmtFullBath  8.939735e-01 4.587930e-04 1.230800e-09  2.094407e-05
## BsmtHalfBath  9.290423e-01 8.022392e-01 6.646033e-02  1.251676e-01
## FullBath      4.502112e-07 3.634485e-12 1.360045e-06 1.669019e-116
## HalfBath      8.788166e-12 6.365788e-02 5.861562e-01  1.872117e-26
## BedroomAbvGr  3.708290e-01 1.780460e-20 4.519871e-06  9.949911e-05
## KitchenAbvGr  4.852405e-28 8.335889e-01 4.971427e-01  1.435624e-12
## TotRmsAbvGrd  1.230183e-01 2.253852e-36 2.460712e-13  6.425172e-66
## Fireplaces    8.175110e-02 5.373474e-21 4.632098e-26  3.086481e-56
## GarageYrBlt   1.566879e-03 1.834203e-02 3.545896e-01 8.371655e-109
## GarageCars    1.255479e-01 5.428777e-24 2.706813e-09 7.037163e-144
## GarageArea    1.592242e-04 6.734807e-35 3.802721e-12 2.434288e-122
## WoodDeckSF    6.310397e-01 2.136524e-03 4.001008e-11  2.126361e-20
## OpenPorchSF   8.158485e-01 1.210210e-07 1.185832e-03  1.245035e-33
## EnclosedPorch 6.458454e-01 7.110477e-01 4.837906e-01  1.276996e-05
## X3SsnPorch    9.414954e-02 1.520968e-02 4.355273e-01  2.461582e-01
## ScreenPorch   3.202578e-01 1.517836e-01 9.924777e-02  1.314582e-02
## PoolArea      7.518384e-01 5.393081e-13 2.979905e-03  1.275642e-02
## MiscVal       7.692691e-01 9.071905e-01 1.459891e-01  2.304122e-01
## MoSold        6.040063e-01 6.982030e-01 9.633078e-01  6.790956e-03
## YrSold        4.137257e-01 7.964813e-01 5.861053e-01  2.963852e-01
## SalePrice     1.266472e-03 2.602442e-36 1.123139e-24 2.185675e-313
##                OverallCond
## MSSubClass    2.342025e-02
## LotFrontage   4.019570e-02
## LotArea       8.296279e-01
## OverallQual   4.361720e-04
## OverallCond   0.000000e+00
## YearBuilt     3.090962e-50
## YearRemodAdd  4.816090e-03
## MasVnrArea    9.716412e-07
## BsmtFinSF1    7.741099e-02
## BsmtFinSF2    1.244256e-01
## BsmtUnfSF     1.530090e-07
## TotalBsmtSF   4.685261e-11
## X1stFlrSF     3.126348e-08
## X2ndFlrSF     2.690898e-01
## LowQualFinSF  3.303245e-01
## GrLivArea     2.311138e-03
## BsmtFullBath  3.580810e-02
## BsmtHalfBath  6.367488e-06
## FullBath      7.242238e-14
## HalfBath      2.022485e-02
## BedroomAbvGr  6.202024e-01
## KitchenAbvGr  8.754658e-04
## TotRmsAbvGrd  2.779335e-02
## Fireplaces    3.630812e-01
## GarageYrBlt   3.914853e-35
## GarageCars    8.425076e-13
## GarageArea    5.945338e-09
## WoodDeckSF    8.987253e-01
## OpenPorchSF   2.133227e-01
## EnclosedPorch 7.159420e-03
## X3SsnPorch    3.301473e-01
## ScreenPorch   3.625216e-02
## PoolArea      9.395944e-01
## MiscVal       8.568159e-03
## MoSold        8.933753e-01
## YrSold        9.321243e-02
## SalePrice     2.912351e-03
# Leave blank on no significant coefficient
corrplot(correlations, type="upper", order="hclust", 
         p.mat = p.mat, sig.level = 0.05, method = 'pie' ,insig = "blank")

scatterplot for the most significance variables with the target(SalesPrice)

plotCorr <- function(data_in, i){
  data <- data.frame(x = data_in[[i]], SalePrice = data_in$SalePrice)
  p <- ggplot(data, aes(x = x, y = SalePrice)) + 
    geom_point(shape = 1, na.rm = TRUE) + 
    geom_smooth(method = lm ) + 
    xlab(paste0(colnames(data_in)[i], '\n', 'R-Squared: ', round(cor(data_in[[i]], data$SalePrice, use = 'complete.obs'), 2))) + 
    theme_light()
  return(suppressWarnings(p))
}


highcorr <- c(names(correlations[,'SalePrice'])[which(correlations[,'SalePrice'] > 0.5)], names(correlations[,'SalePrice'])[which(correlations[,'SalePrice'] < -0.2)])
 
data_corr <- train[,highcorr]


doPlots(data_corr, fun = plotCorr, ii = 1:11)

The histogram for the response variable SalePrice shows that it is skewed. Taking the log of the variable normalizes it.

library(scales)
ggplot(train, aes(x=SalePrice)) + 
  geom_histogram(col = 'white') + 
  theme_light() + 
  scale_x_continuous(labels = comma)

#Normalize distribution
ggplot(train, aes(x=log(SalePrice+1))) + 
  geom_histogram(col = 'white') + 
  theme_light()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

signif_col <- train1_num %>%
  select(YearBuilt, FullBath, OverallQual, GrLivArea, TotRmsAbvGrd, X2ndFlrSF, SalePrice)
# options(repr.plot.width = 20, repr.plot.height = 20)
ggpairs(signif_col)

The correlation plot

corr <- train1_num %>%
  select(SalePrice, TotRmsAbvGrd, GrLivArea)
ggcorr(tidyr::drop_na(corr), palette = "RdBu", label = TRUE)

Test the hypotheses:

  • H0: The correlation is zero.
  • H1: The correlation is not zero.
  • Significance level is 0.05.
(tot_vs_sal <- cor.test(corr$TotRmsAbvGrd, corr$SalePrice, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  corr$TotRmsAbvGrd and corr$SalePrice
## t = 24.099, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4960020 0.5694337
## sample estimates:
##       cor 
## 0.5337232
(gr_vs_tot <- cor.test(corr$GrLivArea, corr$TotRmsAbvGrd, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  corr$GrLivArea and corr$TotRmsAbvGrd
## t = 55.846, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8084234 0.8411686
## sample estimates:
##       cor 
## 0.8254894
(gr_vs_sal <- cor.test(corr$GrLivArea, corr$SalePrice, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  corr$GrLivArea and corr$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6821200 0.7332695
## sample estimates:
##       cor 
## 0.7086245

The p-value in the three varaibles is below significance level of 0.05 - around 2.2e-16. This implies that the correlation falls within the c(0.8, 0.2) confidence interval and we failed to accept the null hypothese which assumes that the correlation is zero.

The 80% confidence interval

library(psychometric)
(CIr(r = tot_vs_sal$estimate, n = nrow(corr), level = 0.8))
## [1] 0.5092841 0.5573021
(CIr(r = gr_vs_sal$estimate, n = nrow(corr), level = 0.8))
## [1] 0.6915087 0.7249450
(CIr(r = gr_vs_tot$estimate, n = nrow(corr), level = 0.8))
## [1] 0.8144931 0.8358928

To calculate the familywize error (type-I error) \(FWE <= 1-(1-alpha)^c\) where:
1. \(alpha\) is the alpha level for an individual test - in this case it is 0.05. 2. \(c\) is the number of comparisons, which is 3.

perc <- (1 - (1 - 0.05)^3)
cat(sprintf("The probability of a type-I error is : %s%s\n", perc, " "))
## The probability of a type-I error is : 0.142625

Yes, I would be worry about that probability, which is very high considering only three tests were performed.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

cor_mat <- data.matrix(cor(corr))
rownames(cor_mat) <- c()
colnames(cor_mat) <- c()
(cor_mat2 <- round(cor_mat, 2))
##      [,1] [,2] [,3]
## [1,] 1.00 0.53 0.71
## [2,] 0.53 1.00 0.83
## [3,] 0.71 0.83 1.00

The determinant of the correlation matrix not zero, the matrix has an inverse

library(matlib)
det(cor_mat2) # get determinant
## [1] 0.150758
(cor_inv <- inv(cor_mat2)) # get the inverse matrix
##            [,1]       [,2]      [,3]
## [1,]  2.0635721  0.3933456 -1.791613
## [2,]  0.3933456  3.2893777 -3.009459
## [3,] -1.7916131 -3.0094589  4.769896
(by_inv <- cor_mat2 %*% cor_inv)
##               [,1]    [,2]     [,3]
## [1,]  1.000000e+00 2.7e-09 -5.9e-09
## [2,]  9.000001e-10 1.0e+00 -6.9e-09
## [3,] -3.000000e-10 1.7e-09  1.0e+00
(res <- round(by_inv, 1))
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

get the inverse

library(matrixcalc)
# LU decomposition 

(cor_lu <- lu.decomposition(cor_mat2))
## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.53 1.0000000    0
## [3,] 0.71 0.6309275    1
## 
## $U
##      [,1]   [,2]      [,3]
## [1,]    1 0.5300 0.7100000
## [2,]    0 0.7191 0.4537000
## [3,]    0 0.0000 0.2096482
# multiply L and U 
# 
(cor_lu$L %*% cor_lu$U) 
##      [,1] [,2] [,3]
## [1,] 1.00 0.53 0.71
## [2,] 0.53 1.00 0.83
## [3,] 0.71 0.83 1.00

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

fit_data <- train$MasVnrArea
fit_data <- fit_data[complete.cases(fit_data)]
hist(fit_data)

summary(fit_data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   103.7   166.0  1600.0
length(fit_data[fit_data == 0])
## [1] 861

Out of the 1452 houses, there are 861 that have no masonry veneers (area is zero).

Because the data measures area, adding a value of .01 should be negligible and would get rid of the zero values. A property with a masonry veneer area of .01 square feet would mean this property does not really have any masonry veneer.

fit_data <- fit_data + .01
hist(fit_data)

summary(fit_data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01    0.01    0.01  103.70  166.01 1600.01
library(MASS)
fit <- fitdistr(fit_data, densfun = "exponential")
# the optimal value
fit$estimate
##        rate 
## 0.009643642
exp <- rexp(length(fit_data), rate = fit$estimate) 
hist(exp)

qexp(.05, rate=fit$estimate)
## [1] 5.318872
qexp(.95, rate=fit$estimate)
## [1] 310.6432
norm.interval = function(data, variance = var(data), conf.level = 0.95) 
{
      z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
      xbar = mean(data)
      sdx = sqrt(variance/length(data))
      c(xbar - z * sdx, xbar + z * sdx)
}

norm.interval(fit_data, variance=var(fit_data), conf.level = 0.95)
## [1]  94.38199 113.00853
quantile(x=fit_data, probs=c(.05, .95))
##     5%    95% 
##   0.01 456.01